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Empiiicai  perfatmance  evaluadons  of  panlld,  ditciete  event  timulation 
alsocithms  using  deadlock  avoidance  and  deadlock  detection  and  lecovery 
techniques  developed  by  Chandy  and  Misra  have  been  perfomed  using  the 
BBN  Butterfly^^multiprocessor.  Experiments  using  synthetic  woridoads 
reveal  that  the  degree  to  which  processes  can  look  ahead  in  simulated  time 
plays  a  critical  role  in  the  petfonnance  of  distiibuied  stmulaion  using 
these  algorithms.  These  lesulu  are  qiplied  to  a  queueing  network 
simulation  where  as  much  as  an  order  of  magnioide  improvement  in 
performance  is  observed  if  the  distributed  simulator  is  programmed  to  fully 
exploit  the  lookahead  available  in  the  application.  Performance 
measurements  of  several  hypercube-based  communication  network 
simulators  provide  additional  empirical  dau  to  support  these  claims. 
These  results  dennonstrate  that  substatuial  improvements  in  performance 
are  obtainable  if  the  application  can  be  programme  to  have  good 
lookahead  characteristics.  On  the  other  hand,  other  applications  inherently 
contain  poor  lookahead  properties,  and  appear  to  be  ill-«uiirri  for  these 
simulation  algorithms,  r-o- 
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Disaeie  event  simulation  has  long  been  a  task  with  compulation 
requirements  that  challenge  the  fastest  available  computers.  For  example, 
simulations  of  communication  networks,  parallel  computer  architectures, 
and  battlefield  scenarios  often  require  hours,  days,  or  even  weeks  of  CPU 
time  using  traditional,  single  processor  techniques.  Simulator  performance 
may  be  improved  using  vectorizing  techniques  [Chan83a],  processors 
dedicated  to  specific  simulation  functions  {Comf84a),  execution  of 
independent  trials  on  separate  processors  [BileSSa],  or  the  execution  of  a 
single  instance  of  a  simulation  program  on  a  parallel  computer.  The  last 
technique,  referred  to  as  distributed  simulation,  is  the  subject  of  this  paper. 

Simulation  would  initially  appear  to  be  a  natural  candidate  for  parallel 
processing  because  many  of  the  aforementioned  applications  contain  a 
high  degree  of  parallelism.  However,  the  exploitation  of  this  parallelism  is 
elusive  because  the  global  notion  of  simulated  time  does  not  easily  map 
onto  a  disoibuied  computer.  This  propeny  distinguishes  distributed 
simulation  from  other  forms  of  parallel  compuiabon. 

Several  schemes  have  been  proposed  to  solve  this  problem.  A  survey  of 
the  literature  has  been  reported  by  Kaudel  [KaudSTa].  One  important  class 
of  distributed  simulation  algorithms  is  the  so-called  “conservative” 
mechanisms.  Chandy  and  Misra  developed  a  mechanism  based  on  a 
deadlock  avoidance  technique  where  null  messages  are  used  to  distribute 
clock  information  among  the  processes  taking  part  in  the  simulation 
|Chan79a,  MisrSfia],  Another  mechanism,  also  developed  by  Chandy  and 
Misra.  is  based  on  a  deadlock  detection  and  recovery  paradigm  —  the 
simulator  runs  until  deadlock,  the  deadlock  is  detected,  and  an  algorithm  is 
executed  to  break  the  deadlock  [Chan81a,Misr86al.  Other  approaches  to 
distributed  simulation  have  been  proposed,  notably  the  Time  Warp 
approach  proposed  by  Jefferson  (JeffSSa],  but  the  work  discussed  here  will 
be  confined  to  deadlock  avoidance  and  deadlock  detection  and  recovery 
techniques. 

In  (Fuji88a]  several  experiments  using  synthetic  workloads  were 
described  that  were  designed  to  evaluate  the  effectiveness  of  distributed 
simulation  strategies  using  the  deadlock  avoidance  and  the  deadlock 
detection  and  recovery  algorithms.  These  experiments  were  performed  on 
a  distributed  simulation  testbed  that  was  implemented  on  the  BBN 
Butterfly,  a  shared-memory  multiprocessor.  Here,  we  apply  these 
results  tt>  specific  application  p^lems  to  provide  empirical  data  to  support 
these  results.  In  particular,  p^lel  simulations  of  queueing  networks  and 
the  communication  subsystem  of  a  hypercube-based  multicomputer 
demonsuate  the  relationship  between  lookahead  in  the  simul^on 
application  and  performance  of  the  parallel  simulator. 
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2.  Lor  leal  Processes.  ActMtlei.  and  Lookahead 

Logical  processes,  activities,  and  lookahead  form  the  basis  for  the 
synthenc  workload  model  that  is  used  here.  The  sLiiulaiian  program 
consists  of  some  number  of  logical  processes,  each  of  which  models  some 
portion  of  the  system  being  simulated.  For  example,  in  simulating  a  digital 
logic  network,  each  gate  (or  some  collection  of  gates)  could  be  modeled  by 
a  logical  process.  Lo^cal  processes  communicate  exclusively  by 
exchanging  limestamped  messages.  Messages  typically  correspond  to 
events  that  Digger  a  change  in  system  state.  Each  logical  process  must 
process  incoming  messages  in  non-decreasing  timestamp  or^  to  ensure 
that  cause-and-effect  relationships  are  faithfully  reproduced  by  the 
simulator. 

We  informally  define  an  activity  as  a  sequence  or  thread  of  events  that 
propagates  among  the  logical  processes  in  the  simulation.  These  events 
model  some  sequence  of  cause-and-effect  relationships  in  the  system  being 
simulated.  For  example,  in  a  logic  simulation,  individual  events  are  logic 
signal  transitions  and  each  activity  corresponds  to  a  signal  propagating 
through  a  sequence  of  logic  gates.  In  a  queueing  network  simulation,  each 
activity  corresponds  to  a  job  traveling  through  the  network.  Activities  are 
usually  dynamic.  A  new  activity  is  created  in  the  logic  simulation 
whenever  an  existing  activity  reaches  a  fanout  point  in  the  network.  The 
activity  disappears  when  (for  instance)  it  reaches  an  AND  gate  with  a  logic 
zero  on  one  of  the  other  input  lines.  For  our  purposes,  this  informal 
definition  of  activities  and  logical  processes  will  suffice. 

Logical  processes  often  "look  ahead"  into  the  simulated  time  future  to 
schedule  new  events.  For  example,  upon  receiving  a  signal  uansition 
event  in  a  logical  process  for  an  inverter  gate,  the  process  can  predict  and 
schedule  a  new  event  (a  signal  transition  at  the  ouqyut  of  the  gate)  one  gate 
delay  later  in  simulated  time.  The  lookahead  abilities  of  the  process 
determine  how  readily  it  will  schedule  new  events.  Processes  such  as  the 
inverter  with  good  lookahead  abilities  can  "see"  sufficiently  far  into  the 
future  that  "effect"  events  can  be  scheduled  as  soon  as  the  “cause"  event 
is  received.  On  the  other  hand,  processes  with  poor  lookahead  ability  must 
first  wait  until  simulated  lime  is  advanced  before  they  can  schedule  the 
effect  event  For  example,  in  a  queueing  network  simulation  with 
prioritized  jobs,  the  "departure"  event  for  a  low  priority  job  cannot  be 
scheduled  until  it  is  first  determined  that  no  higher  priority  job  will 
preempt  it 

Quantitatively,  lookahead  is  defined  as  follows:  if  a  process  has 
knowledge  of  ^1  events  that  will  occur  up  to  simulated  time  T,  and  can 
predict  all  new  events  it  will  generate  with  timestamp  T+L  or  less,  then 
the  process  is  said  to  have  lookahead  L.  In  general,  lookahead  is  a 
complex  function  that  varies  with  lime  and  the  type  of  event  and  is  highly 
dependent  on  details  of  the  simulation  problem  and  the  way  it  is 
programmed.  A  process  can  schedule  a  future  event  so  long  as  the 
timestamp  on  that  event  is  less  than  or  equal  to  the  process's  local  clock 
plus  its  lookahead.  Such  events  are  said  to  be  within  the  “lookahead 
horizon"  of  the  process. 

Consider  a  "cause"  event  with  timestamp  T^,  that  leads  to  an 
"effect"  event  with  timestamp  The  absolute  v^ue  of  lookahead  is 

not  as  important  as  the  lookahead  relative  to  T^,  -  _ _  because  this 

will  determine  how  far  the  process  must  advance  in  simulated  time  to 
generate  the  new  event  Therefore,  we  define  a  quantity  referred  to  as  the 
lookahead  ratio  (LAK)  : 


LAX 


lookahead 

A  low  (e.g.,  1 .0)  LAX  corresponds  to  a  high  degree  of  lookahead. 

3.  Thc  Distributed  Simulation  Testbed 
An  18  processor  BBN  Butterfly  multiprocessor  was  used  for 
experimentation.  Each  processor  node  contains  a  16  MHz  MC68020  with 
MC6888 1  floating  point  coprocessor,  I  to  4  MBytes  of  memory,  and  a 
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Table  1.  Hardware  Parameters 


Operation 

Execution  Time 
(microeeconds) 

Local  memory  reference 

0.60 

Remote  memory  reference 

4.0 

Register-to-register  instruction 

0.71 

16  bit  Load  (Local  Memory) 

1.3 

16  bit  Load  Remote  Memory) 

6.3 

Parameteriess  function  call 

6.9 

Atomic  inclusive  OR 

20 

processor  node  comroUer  (PNQ,  a  microcoded  engine  that  processes  local 
and  lemoie  memoty  requests,  llie  interconnection  switch  is  configined  as 
an  Omega  neiworic.  Atomic  lest-and-set  like  memory  operations  are  also 
implemented  in  the  PNC.  Execution  times  of  various  instructions  and 
operations  are  shown  in  table  1.  Experimental  dau  indicate  that  switch 
contention,  and  hot  spot  congestion  in  particular,  is  unlikely  (Thom86al. 

Each  processor  executes  a  single  operating  system  process.  This 
process  is  a  scheduler  that  time  multiplexes  execution  of  the  simulation 
processes  mapped  to  the  processor.  This  strategy  avoids  excessive  context 
switching  overhead,  and  allows  more  direct  control  over  the  process 
scheduling  mechanism.  Asynchrortous  message  passing  primitives  were 
constructed  using  direct  memory  accesses  to  the  mailbox  in  the  receiving 
simulator  process.  Only  a  few  simple  Butterfly  primitives,  namely  lock 
and  atomic-add  operations,  are  used  by  the  testb^  after  initialization  is 
complete. 


4,  The  Simulation  Algorithms 

Two  distributed  simulation  algorithms  were  implemented  in  the  testbed: 
one  based  on  deadlock  avoidance  and  another  ba^  on  deadlock  detection 
and  recovery.  The  shared  memory  architecture  of  the  Butterfly  was  used 
to  improve  the  efficiency  of  these  rigorithms,  as  described  below.  A  single 
processor,  event  list  implementation  was  also  developed  in  order  to 
compute  speedup. 

4,1  Deadlock  Avoidance  Strateev 

The  deadlock  avoidance  scheme  developed  by  Chandy  and  Misia  was 
implemented  first.  Each  logical  process  sends  a  null  message  to  each  of  its 
neighbors  whenever  it  blocks.  The  timestamp  on  this  message  represents  a 
lower  bound  of  the  timestamp  on  any  message  that  will  be  sent  to  the 
receiver  in  the  future.  It  is  equal  to  the  local  clock  value  of  the  process 
plus  the  lookahead  value  because,  by  definition,  the  process  cannot  predict 
the  occurrence  (or  non-occurrence)  of  events  further  into  the  future. 
Chandy  and  Misra  have  shown  that  this  approach  is  sufficient  to  avoid 
deadlock  [Chan79a]. 

In  the  testbed,  one  optimization  was  performed  to  streamline  the 
processing  of  null  messages.  Rather  than  enqueueing  each  null  message 
sent  to  another  processor,  a  single  variable  is  associated  with  each  input 
link  that  contains  the  timestamp  of  the  last  null  message  that  was  received. 
This  avoids  unnecessary  enqueue  and  dequeue  operations  and  leads  to 
more  efficient  memory  utilization. 

4J  Deadlock  PetectloB  and  Recovery  Strategy 

The  second  simulation  approach  is  based  on  deadlock  detection  and 
recovery.  The  simulation  tuns  until  deadlock,  the  deadlock  is  detected, 
and  an  algorithm  is  initialed  to  break  the  deadlock  [ChanSlaJ.  A  central 
controller  is  used  to  coordinate  the  deadlock  recovery  procedure. 

Deadlock  in  the  testbed  is  easily  detected  by  maintaining  a  global 
counter  indicating  the  number  of  processes  that  are  either  scheduled  or 
running.  The  system  is  deadlocked  whenever  the  counter  reaches  zero  and 
there  is  at  least  one  process  that  has  not  yet  terminated  (otherwise,  the 
compulation  has  terminated).  Each  scheduler  checks  the  de^ock  counter 
whenever  it  fails  to  find  a  process  to  run,  and  initiates  a  compulation  to 
break  the  deadlock  if  it  finds  the  counter  is  zero. 

The  deadlock  recovery  algorithm  locales  the  message  in  the  system  with 
the  smallest  timestamp  and  arranges  for  it  to  be  processed  next.  A 
distributed  algorithm  is  used  to  perform  this  computation.  A  central 
controller  is  used  to  cowdinate  this  activity.  By  convention,  the  scheduler 
executing  on  PE  0  acts  as  the  controller. 

An  allenudve  deadlock  recovery  algorithm  was  also  implemented  in 
which  messages  are  propagated  throughout  the  system  in  order  to  restart  as 


many  processes  as  possible.  This  algorithm  is  described  in  [ChanSIa).  It 
was  found,  however,  that  the  additional  lime  required  to  execute  this 
algorithm  yielded  a  net  loss  in  performance.  The  performance  figures 
reported  htse  are  based  on  the  former  deadlock  recovery  approach. 

43  Uniprocessor  Simulation  Ateorlthm 
Finally,  a  single  processor,  event  list  simulator  was  developed  to  allow 
comparison  of  distributed  simulation  programs  with  sequenthd  event  list 
implementations.  In  order  to  obtain  a  fair  comparison,  the  uniprocessor 
simulator  was  constructed  by  modifying  the  distributed  simulator.  Both 
implementatioos  maintain  the  same  overall  structure,  organization, 
programming  style,  and  conventions.  All  code  specific  to  parallel 
computation  (e.g.,  synchronization  locks)  was  eliminated. 

The  event  list  was  implemented  as  a  splay  tree  [SleaSSa],  Empirical 
evidence  suggests  that  splay  trees  are  among  the  fastest  methods  for 
implementing  an  event  list  [JoneSfia],  An  alterrutive  implementation 
using  a  singly  linked  linear  list  was  also  developed.  It  was  found  that  this 
implementation  yielded  performance  comparable  to  the  splay  nee  for  small 
simulations  but,  as  expected,  tan  much  more  slowly  for  the  larger 
simulations.  The  splay  tree  implementation  is  used  in  all  comparisons  with 
uniprocessor  simuUuions  reported  here. 

4.4  Performance  Metrics 

Three  metrics  are  defined  to  evaluate  the  performance  of  the  distributed 
simulation  programs: 

•  Speedup.  SU(n),  the  speedup  using  n  processors,  is  defined  as  the 
execution  time  of  the  single  processor,  event  list  implementation  using  a 
splay  tree  divided  by  the  execution  time  of  the  distributed  simulation 
program  when  n  processors  are  used. 

•  Nun  Message  Ratio.  NMR  is  defined  as  the  number  of  null  messages 
processed  by  the  simulator  using  deadlock  avoidance  divided  by  the 
number  of  real  (non-null)  messages  processed.  This  measures  the 
overhead  of  the  deadlock  avoidance  approach. 

•  Deadlock  Ratio.  DR  is  the  number  of  messages  processed  by  the 
distributed  simulator  using  deadlock  detection  and  recovery,  divided  by 
the  number  of  deadlocks  that  occur.  This  figure  measures  the  efficiency 
of  the  deadlock  detection  and  recovery  algorithm. 

The  single  processor  execution  times  were  obtained  by  running  the  splay 
tree  simulauyr  on  a  single  node  of  the  Butterfly.  The  same  compiler  as  that 
used  by  the  distributed  simulaurr  was  used.  Therefore,  compiler  and 
processor  speed  dependencies  are  facured  out  of  the  speedup  figures. 

The  experiments  wrere  performed  with  no  other  applications  running  on 
the  Butt^y.  Facilities,  such  as  the  window  manager,  were  run  on 
processors  different  from  those  executing  the  simulation  program.  These 
measures  were  taken  to  minimize  interference  with  the  computation. 

Experimental  data  were,  for  the  most  part,  well  behaved.  The  95 
percent  confidence  intervals  for  the  measured  data  were  typically  less  than 
one  or  two  percent  of  the  reported  value.  Only  in  a  few  instances  were 
significant  variations  observed  born  one  measurement  to  another.  These 
were  related  to  the  avalanche  effect  described  later,  and  do  not  affect  the 
conclusions  that  follow  from  these  experiments. 

5,  Experiments  Usini  Synthetic  Workloads 

Synthetic  workloads  were  constructed  based  on  the  notions  of  logical 
processes,  activities,  and  lookahead,  described  earlier.  Workloads 
conuined  16  and  64  logical  processes  organized  in  4  by  4  and  8  by  8 
toroids,  reqieclively  (a  toroid  is  a  nearest  neighbor  mesh  with  wrap-around 
edge  connections).  Toroids  were  used  because  they  do  not  contain 
inherent  bottlenecks  that  might  color  the  results,  and  because  they  are  rich 
in  cycles,  and  therefore  represent  a  reasorubly  challenging  configuration 
for  the  simulation  algorithms.  It  is  assumed  that  the  number  of  activities  in 
the  simulation  remains  constant,  and  the  lookahead  of  each  process 
remains  fixed  throughout  the  simulation  and  does  not  depend  on  the  type 
of  event  Within  each  experiment,  a  fixed  number  of  messages  (the 
message  population)  circulates  in  a  manner  similar  to  jobs  traveling 
throughout  a  closed  queueing  network.  Simulation  activity  in  each  process 
was  emulated  using  tesy  wait  loops. 

The  experiments  discussed  next  assume  a  message  population  of  four 
messages  per  process  and  an  average  computation  time  of  1  millisecond 
(selected  from  a  random  variable  with  a  negative  exponential  distribution) 
to  process  each  incoming  message.  A  static  process  to  processor  mapping 


Speedup  Uiing  Deadlock  Avoidance 
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Figure  I.  Speedup  of  synthetic  woiUoad  *s  lookiheid  is  varied. 


was  used  that  balanced  (he  workload  assigned  to  the  available  processors 
while  minimizing  inierprocessor  communications. 

Numerous  experiments  were  conducted  to  examine  the  effects  of 
computation  granularity,  dynamic  load  balancing,  message  popubtion, 
message  routing,  and  other  factors.  A  detailed  description  of  these  re.  alts 
is  beyond  the  scope  of  the  present  discussion,  but  is  described  elsewhere 
rFuji87a,Fuji88a].  We  will  summarize  some  of  these  results  and  discuss 
how  (hey  can  be  applied  to  a  specific  application. 

5.1  Effect  of  Lookahead 

The  speedup  curves  in  figure  1  show  the  effect  of  varying  lookahead  in 
(he  deadlock  avoidance  simubtor.  As  can  be  seen,  lool^ead  plays  a 
critical  role  in  detennining  simubtor  performance.  Perfomtance  degrades 
significantly  as  the  lookahead  ability  of  each  process  is  reduced.  Processes 
with  poor  lookahead  characieristics  must  delay  generating  new  events, 
reducing  the  amount  of  parallelism  available  in  the  simulation. 

Performance  of  the  16  node  toroid  is  somewhat  less  than  the  64  node 
toroid  because  the  simubiion  does  not  contain  sufficient  parallelism  lo 
keep  all  of  (he  processors  busy.  In  addition,  as  (he  number  of  processes 
per  processor  is  decreased,  each  process  is  afforded  less  time  to  collect 
messages  before  it  is  executed  by  the  scheduler.  As  a  result,  a  process  may 
be  scheduled  more  often  than  if  there  were  more  processes  mapped  to  the 
processor.  The  additional  scheduling  overhead  and  increased  idle  lime 
lead  to  poorer  performance  in  the  16  node  simubtor,  particularly  as  the 
number  of  processors  is  increased. 

5.2  Message  Avitonche 

Experiments  using  the  deadlock  detection  and  recovery  strategy  also 
revealed  an  “avalanche"  phenomenon.  This  behavior  is  depicted  in  figure 
2  where  the  deadlock  ratio  is  ploued  as  a  function  of  the  message 
popubtion.  Performance  remains  poor  (only  a  few  messages  processed 
between  deadlocks)  at  low  and  moderate  message  popubtions,  but  then 
increases  dramatic^y  once  message  population  reacha  a  ceiuin  critical 
level.  It  was  found  that  message  avalanche  was  a  prerequisite  for 
achieving  good  performance  for  thb  simubtion  strategy. 

Message  avalanche  occurs  when  a  message  arriving  at  a  process  causes 
the  transmission  of  one  or  more  additional  messages,  which  in  turn  trigger 
the  transmission  of  still  others,  and  so  on.  A  multiplicative  effect  occurs 
whereby  an  “avalanche"  of  message  traffic  resulb  from  the  original, 
accounting  for  the  dramatic  improvement  in  simubtor  efficiency. 

As  shown  in  figure  2,  the  message  population  required  to  induce 
avalanche  was  found  to  be  dependent  on  the  lookahead  ability  of  the 
processes.  Smaller  popubtions  were  required  to  induce  avalanche  if 
processes  were  able  to  see  far  into  the  simubied  future.  This  is  again 
because  poor  lookahead  characierisdcs  reduce  the  amount  of  paralblism  in 
the  simulautr. 


Message  Avalanche 

Deadlock  Detection  and  Recovery  Strategy 


Figure  2.  Message  avalanche  occurs  as  the  message  population  is  increased 

5J  Processes  with  Different  Lookaheads 

The  experiments  described  above  used  homogeneous  workloads  where 
each  process  behaved  in  the  same  way  as  the  others.  Many  real 
simulations  contain  a  variety  of  logical  processes  with  different  lookahead 
characteristics.  Additional  experiments  were  performed  in  which  some 
processes  had  poorer  lookahead  characteristics  than  the  others. 

Figures  3  and  4  show  simuburr  overhead  for  the  deadlock  detection  and 
recovery,  and  deadlock  avoidance  simubuxs,  respectively,  when  some 
number  of  processes  with  poor  lookahead  chvacieristics  are  mixed  with 
processes  with  good  lookahead  characieristics.  Experiments  were 
performed  in  which  one,  one  fourth,  one  half,  and  finally  all  processes 
have  poor  lookahead  (high  LAX ).  Figure  3  indicates  that  the  presence  of  a 
few  pnxesses  with  poor  lookahead  results  in  a  perceivable  performance 
degradation  in  the  deadlock  detection  and  recovery  simulator  (the 
avalanche  point  is  moved  tt>  higher  message  popul^ons).  When  a 
significant  fraction  of  the  processes  have  poor  lookahead,  performance  is 
almost  the  same  as  that  when  all  processes  have  poor  lookahead.  The 
deadlock  avoidance  simubtor  was  found  not  to  be  as  susceptible  to  such 
behavior  (see  figure  4),  though  some  degradation  results  if  a  sufficiently 
high  fraction  have  poor  lookahead  properties. 

6.  Queueing  Network  Simubtions 

To  illusoate  the  qiplicability  of  the  above  results  in  a  specific 
application,  queueing  network  simubtions  were  performed.  A  five 
process,  cenU'^  server  network  was  simulated  on  the  testbed.  As  shown  in 
figure  S,  this  network  contains  three  first-come-first-serve  (FCFS) 
processes  that  service  incoming  jobs  in  the  order  in  which  they  atrive.  a 
fork  process  that  stochastically  routes  each  incoming  job  to  one  of  its 
output  pons  (assume  for  now  that  either  pon  is  equally  likely  to  be 
selMted),  and  a  merge  process  that  combines  streams  of  incoming  jobs  into 
a  single  output  stream.  Each  server  process  also  computes  the  average 
number  of  jobs  in  the  server  and  repc^  this  figure  to  the  user. 

Simubiion  and  empirical  studies  by  Seethabkshmi  and  Reed 
respectively  concluded  that  the  centnl  server  netwoik  is  ill-suited  for  the 
conservative  distributed  simubtion  algorithms  discussed  here 
[Seet79a,Reed88a].  We  reproduce  and  explain  the  poor  results  that  these 
researchers  observed  in  terms  of  message  popubtion  and  lookahead,  and 
utilize  this  knowledge  to  improve  performance. 

The  "classical"  implementation  of  the  FCFS  process  uses  two  types  of 
events:  arrival  events  (scheduled  by  other  processes)  denote  jobs  arriving 
at  the  server,  and  departure  events  (scheduled  by  the  FCFS  process  itselO 
denote  jobs  completing  service.  The  actions  executed  by  the  server 
process  for  each  event  type  are  shown  in  figure  6.  NJobs  indicates  the 
number  of  jobs  cuirently  residing  in  the  server,  and  ServiceTime  indicates 
the  time  required  to  service  each  job.  Code  for  computing  statistics  is  not 
shown. 

The  classical  server  process  has  very  poor  lookahead  properties.  This  is 
because  it  will  not  transmit  an  arrival  event  message  with  timestamp  TS 
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Figure  S.  Central  server  queueing  model. 

ARRIVAL  EVENT  at  TIME  T; 

NJobs  :«NJobs+  1; 

IF  (NJobs  «  1)  THEN  /•  if  server  was  previously  idle  */ 
Schedule  (local)  Departure  Even  at  time  T  ♦  ServiceTime; 
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Figure  4. 0>'erliead  with  non-unifonn  lookahead  —  deadlock  avoidance. 

until  it  has  first  advanced  its  local  simulated  time  clock  to  TS  by 
processing  a  departure  event  In  effect,  it  has  a  lookahead  value  of  zero. 

The  lookahead  properties  of  the  FCFS  process  can  be  improved  by 
eliminating  the  departure  event  and  generating  a  new  arrival  event  as  soon 
as  one  is  received.  Because  an  FCFS  queueing  disciplitte  is  used,  the 
departure  time  can  be  determined  as  soon  as  the  message  is  received.  The 
optimized  program  is  shown  in  figure  7.  EndService  denotes  the  time  at 
which  the  server  process  will  become  idle  if  no  additional  jobs  ate 
received  in  the  future.  This  program  exhibits  very  good  lookahead  abilities 
because  it  can  schedule  events  far  into  the  simulated  time  future. 

6.1  Performance  Using  Identical  Servers 

Simulators  using  each  of  these  server  programs  were  developed  and 
executed  on  the  Butterfly  testbed.  In  all  of  the  experiments  described 
below,  each  logical  process  was  mapped  to  a  separate  processor,  and  static 
scheduling  was  used.  Service  times  for  server  processes  were  selected 
either  deterministically  or  bom  a  random  variable  with  a  negative 
exponential  distribution. 

The  resulting  speedup  and  simulaurr  efficiencies  for  the  central  server  ' 
queueing  model  using  the  deadlock  detection  and  recovery  strategy  are 
shown  in  figures  8  and  9,  respectively.  The  deadlock  avoidance  simulator 
yielded  similar  speedups.  As  can  be  seen,  reprogramming  the  server  to 
have  better  lookahead  characteristics  dramatically  improves  performance. 
Speedup  is  improved  by  as  much  as  an  order  of  magnitude.  These  results 
are  consistent  with  those  obtained  using  synthetic  workloads. 

The  performance  results  of  the  classical  server  process  are  qualitatively 
similar  to  those  repor<ed  by  Reed  and  Seethalakshmi.  The  servers  used  in 


DEPARTURE  EVENT  at  TIME  T: 

Schedule  (remote)  Arrival  Event  at  time  T; 

NJobs :=  NJobs -  1; 

IF  (NJobs  >  0)  THEN  /•  if  job(s)  waiting  in  queue  •/ 

Schedule  (local)  Departure  Event  at  time  T  +  ServiceTime: 

Figure  6.  “Classical"  program  for  FCFS  server  (poor  lookahead) 

ARRIVAL  EVENT  at  TIME  T: 

IF  (T  <  EndService)  THEN  /•  if  server  busy  •/ 

BEGIN 

Schedule  (remote)  Arrival  Event  at  time  EndService+ScrviccTimv, 
EndService  :=  EndService  +  ServiceTime; 

END 

ELSE  /•  server  idle  •/ 

BEGIN 

Schedule  (remote)  Arrival  Event  at  time  T  +  ServiceTime; 
EndService  :»  T  +  ServieeTime; 

END 

Figure  7.  Optimized  program  for  FCFS  server  (good  lookahead). 

those  studies  are  a  variation  of  the  classical  server  described  above,  and 
share  the  same  (poor)  kxdtahead  properties  —  a  message  will  not  be 
forwarded  until  another  message  is  first  received  with  a  timestamp  at  least 
as  large  as  the  departure  time  of  the  firsL  Therefore,  lookahead  provides 
an  explanation  for  the  poor  performance  that  they  observed. 

Although  the  above  results  ate  encouraging,  it  is  important  to  keep  in 
mind  that  reprogramming  the  application  to  exhibit  greater  lookahead 
ability  is  not  ^ways  possible.  The  above  opbmization  relied  on  the  servers 
using  an  FCFS  scheduling  discipline.  As  we  shall  soon  see.  many 
applications  inherently  contain  poor  lookahead  properties. 

Finally  we  note  that,  at  first  glance,  reprogramming  logical  processes  to 
maximize  lookahead  may  complicate  other  aspects  of  the  simulation,  e.g.. 
statistics  collection.  For  example,  the  optimizMl  server  does  not  pause  for 
departure  events,  so  statistics  that  ate  most  easily  collected  at  job  departure 
must  be  collected  at  other  points  in  simulated  time.  This  problem  is  easily 
reconciled  by  scheduling  local  departure  events  (as  was  done  before)  that 
are  only  used  for  statistics  collection  purposes. 

t2  Performance  Uilna  Mixed  Servers 

Additional  experiments  were  performed  to  examine  the  effect  of  mixing 
processes  with  poor  and  good  lookahead  characteristics.  Recall  that 
experiments  using  synthetic  woridoads  revealed  that  a  small  number  of 
processes  with  poor  lookahead  could  significantly  degrade  performance  of 
die  deadlock  detection  and  recovery  simulator.  The  deadlock  avoidance 
simulator  was  found  not  to  be  as  susceptible  to  such  behavior. 

The  central  server  queueing  network  simulations  were  repeated  where 
one  of  the  three  servers  was  implemented  using  the  classical  server 
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Figure  8.  Speedup  of  central  server  queueing  model. 
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Figure  9.  Overhead  of  central  server  queueing  network  simulator. 
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Figure  10.  Speedup  of  detection  and  recovery  simulator  with  one  classical  server. 
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Figure  1 1 .  Overhead  of  detection  and  recovery  simulator  with  one  classical  sen  er 


program  described  earlier,  and  the  remaining  servers  used  the  optimized 
program.  The  resulting  simulator  is  not  unlike  one  that  would  result  if  one 
of  the  servers  was  (say)  a  prioritized  queue  while  the  others  were  PCF5. 

The  speedup  and  efficiency  of  the  deadlock  detection  and  recovery 
simulator  is  shown  in  figures  10  and  II.  When  ihe  central  server  (the 
process  receiving  messages  from  the  merge  process)  has  poor  locricah^ 
properties,  perfotmance  is  almost  as  poor  as  when  all  of  the  servers  have 
poor  lookahead.  When  one  of  the  secondary  servers  (the  servers  receiving 
messages  6om  the  fork  process)  has  poor  lookahead,  performance  is 
better,  but  still  well  below  that  of  the  simulator  using  only  optimized 
servers.  These  results  are  consistent  with  those  obtained  using  synthetic 
workloads,  and  demonstrate  that  a  few  processes  with  poor  lookahead  can 
significantly  degrade  overall  performance  in  the  deadlock  detection  and 
recovery  simulator. 

When  the  classical  program  was  used  to  implement  a  secondary  server, 
the  routing  probabilities  in  the  fork  were  modified  so  that  10,  SO,  and 
finally  90  percent  of  the  message  traffic  was  routed  to  the  classical  server. 
It  is  interesting  to  note  that  performance  improves  as  mart  traffic  is  routed 
toward  the  server  with  poor  lookahead.  If  little  traffic  is  directed  toward 
this  server,  the  simulator  is  constantly  deadlocking  because  the  merge 
process  is  forced  to  block  because  it  cannot  determine  whether  or  not  it  is 
safe  to  proceed  without  first  receiving  a  message  from  this  server.  Routing 
additional  message  traffic  toward  this  server  helps  the  simulator  to 
overcome  (somewhat)  the  server's  poor  lookahead  characteristics. 


Speedup  and  overhead  curves  for  the  deadlock  avoidance  simulator  are 
shown  in  figures  12  and  13.  The  deadlock  avoidance  simulator  tends  to  be 
more  forgiving  of  processes  with  poor  lookahead.  Poor  performance 
results  when  the  central  server  process  has  poor  lookahead.  However, 
performance  begins  to  approach  that  of  the  optimized  simulator  in  some 
siuations  where  one  of  the  secondary  servers  has  poor  lookahead.  In 
particular,  good  perfonnance  is  obtained  if  a  significant  fraction  of  the 
message  traffic  (M  to  90  percent)  is  routed  around  the  process  with  poor 
lookahead.  Unlike  the  d^lock  detection  and  recovery  simulator,  null 
message  traffic  is  generated  by  the  classical  server  to  allow  the  merge 
process  to  proceed.  Because  processes  with  poor  lookahead  tend  to  buffer 
messages  rather  than  immediately  forwarding  them,  it  is  best  to  minimize 
the  amount  of  traffic  routed  to  the  classical  server  because  this  only 
detracts  from  the  available  parallelism. 

7.  Communication  Network  Simulallons 

Simulations  of  the  message  passing  subsystem  of  a  hypothetical 
multicomputer  were  also  performed.  The  multicomputer  is  organized  in  a 
hypercube  topology,  and  Sullivan's  algorithm  is  used  to  route  messages  U) 
their  respective  d«tinations  [Sull77a].  Like  the  queueing  network  and 
synthetic  workload  experiments,  a  fixed  message  population  was  used  to 
control  the  amount  of  available  parallelism.  Initially,  each  message  is 
assigned  a  destination  to  which  it  is  to  be  touted,  and  a  message  length. 
The  destination  is  selected  from  a  uniform  distribution  (excluding  the 
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Figure  12.  Speedup  of  deadlock  avoidance  simulator  with  one  classical  server. 

processor  where  the  message  initially  resides),  and  the  message  length  is 
selected  from  an  expoiKntial  distribution.  When  a  message  reaches  its 
6nal  destination,  a  new  destination  and  message  length  are  selected.  All 
communication  links  in  the  hypercube  are  assumed  to  provide  the  same 
bandwidth.  Three  simubtors  were  developed  that  contain  varying  degrees 
of  lookahead,  as  will  be  described  next. 

7,1  A  Simulator  with  High  Lookahead 

FCFS  is  a  simulator  in  which  messages  are  simply  forwarded  on  the 
output  link  selected  by  the  routing  algorithm  in  FCFS  order.  Like  die 
FCFS  queueing  network  described  earlier,  this  simulator  has  great 
lookahead  ability  because  messages  arriving  at  a  logical  process  (with 
timestamp  denoting  the  arrival  time  in  the  hypercube)  can  be  immediately 
forwarded. 

7J  A  Simulator  with  Moderate  Lookahead 

PRfO  is  a  simulator  with  intermediate  lookahead  properties.  Here, 
messages  are  classified  as  either  high  priority  or  low  priority. 
Communication  links  in  the  hypercube  give  preference  to  high  priority 
messages  when  selecting  the  next  message  to  be  transmitted.  A  low 
priority  message  is  only  forwarded  if  there  are  no  high  priority  messages 
waiting  to  use  the  link.  Messages  within  each  priority  level  are  processed 
in  FCFS  order.  Each  message  is  assigned  a  new  priority  whenever  a  new 
destination  address  and  message  length  are  selected  and  maintains  this 
priority  until  it  reaches  the  destination  processor. 

No  preemption  occurs  in  this  simulator.  Once  the  link  begins 
forwarding  a  low  priority  message,  it  will  continue  to  send  it,  even  if  a 
high  priority  message  arrives  before  transmission  is  complete. 

The  parallel  simulator  for  this  system  has  intermediate  lookahead 
properties.  Logical  processes  have  excellent  lookahead  for  high  priority 
messages,  but  poorer  lookahead  for  those  with  low  priority.  Just  as  is  the 
case  for  the  FCFS  simulator,  high  priority  messages  can  be  forwarded  as 
soon  as  they  arrive  because  the  departure  time  can  be  immediately 
determined.  However,  a  low  priority  message  caruiot  be  forward  until 
simulated  time  in  the  logical  process  has  advanced  to  the  departure  time 
(the  time  the  hypercube  begins  sending  the  message)  because  it  must  first 
be  determined  that  no  high  priority  message  will  receive  service  ahead  of 
it 

7.3  A  Simubtor  with  Poor  Lookahead 

The  third  simulator,  PREEMPT,  is  identical  to  the  PRIORITY  simulator 
except  that  high  priority  messages  preempt  service  of  low  priority 
messages.  When  a  low  priority  message  is  preempted,  it  is  assumed  that 
the  message  must  be  completely  resent  once  no  other  high  pricuity 
messages  remain  that  are  waiting  to  use  the  link.  The  simulator  for  this 
system  cannot  forward  a  message  to  another  logical  process  until  simulated 
time  has  advanced  to  the  arrival  time  (the  time  the  tail  of  the  message 
reaches  the  receiving  hypercube  node),  so  it  has  even  poorer  lookahead 
properties  than  the  preceding  simulator. 
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Figure  13.  Overhead  of  deadlock  avoidance  simulamr  with  one  classical  sen  er 

7,4  Performance  Resulb 

The  hypereube  simulations  were  performed  on  the  Butterfly,  and 
compared  with  execution  of  the  sequential  event  list  implementation. 
Unlike  the  previous  experiments,  these  were  performed  on  the  Buuerfly 
Plus,  an  upgraded  version  of  the  Butterfly  that  features  32  bit  data  paths 
(the  original  Butterfly  has  16  bit  data  paths).  The  switch  remains  the  same, 
so  this  effectively  increases  the  cost  of  interprocessor  communications. 
Because  the  simulation  testbed  already  minimizes  interprocessor 
communication,  no  program  modifications  were  required.  Experiments 
indicated  that  this  hvdware  modification  did  not  significantly  affect  the 
speedup  measures  derived  earlier. 

Overhead  for  these  three  simulamrs  is  shown  in  figures  14  and  IS  for 
hypercubes  of  dimensions  4  and  6  (16  and  64  nodes  respectively).  Eight 
processors  were  used  in  these  experiments.  Upon  teaching  its  destination, 
each  message  is  assigned  a  high  priority  with  probability  .  In  these 
experiments,  was  selected  to  be  either  0.01  or  0.50. 

As  predicted,  the  observed  overhead  steadily  increases  as  the  lookahead 
properties  of  the  simulation  are  diminished.  This  is  reflected  in  higher  null 
message  ratios  in  the  deadlock  avoidance  simulator,  and  a  larger  message 
population  required  to  induce  avalanche  in  the  detection  and  recovery- 
simulator.  Overheads  are  generally  lower  in  the  dimension  four  hypercube 
than  the  cube  of  dimension  six  for  a  fixed  message  population  (as 
measured  in  messages  per  process)  because  there  are  fewer 
communication  links;  the  simulators  operate  at  peak  efficiency  when  (here 
is  at  least  one  message  on  each  incoming  link  because  no  blocking  occurs. 

The  lookahead  properties  of  the  simulator  increase  as  Pi^„s  increases 
because  more  high  priority  messages  are  generated  that  can  be  forw  arded 
as  soon  as  they  are  received.  This  explains  the  lower  overheads  that  were 
observed  when  P|^  was  increased. 

Speedup  curves  for  the  hypeicube  simulators  are  shown  in  figures  16 
and  17.  Using  eight  processors,  the  parallel  simulator  executed  anywhere 
from  5.7  times  faster  to  nearly  20  times  slower  than  the  splay  tree 
simulator,  depending  on  the  look^ead  properties  of  the  application.  Some 
data  points  for  very  high  message  populations  are  missing  because 
insufficient  memory  was  available  on  a  single  processor  to  conduct  an 
event  list  simulation. 

The  hypercube  simulations  provide  additional  evidence  to  support  our 
contention  that  lookahead  properties  of  the  application  are  crucial  to 
obtaining  efficient  performance  for  simulators  using  the  deadlock 
avoidance  and  deadlock  detection  and  recovery  strategies.  While  the 
queueing  network  simulations  demonstrated  that  it  is  possible  to  obtain 
dramatic  speedups  by  reprogramming  the  simulation  to  fully  exploit  ris 
lookahead  properties,  these  experiments  demonstrate  that  some  simulations 
inherently  contain  poor  lookahead,  and  cannot  be  improved  by 
reprogramming.  Such  simulations  appear  to  be  poorly  suited  for  the 
conservative  simulation  algorithms  using  deadlock  avoidance  and 
deadlock  detection  and  recovery  techniques,  except  in  a  few  special 
circumstances  such  as  networks  that  contain  no  feedback  loops. 
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Figure  14.  Oveitiead  in  hypercube  simulator  using  deadlock  recovery. 

Null  Message  Ratio  for  Hypercube  Simulator 
gprecessors 


IS  nodes 
I  FCFS 

I  c  PRJO  (no  pncmpiion).  0.50  \ 

D  PRIO  (no  pmempuon).  r^*  0.01 
*  PREEMPT. /"-.nO JO 
T  PREEMPT.  0.01 

I  Mnodu 
FCFS 

'  PRIO  (no  pnempuon).  P^pw*  0.50 
I  ♦  PRIO  (no  preompijon).  0.01 
t  PREEMPT.  P»..0J0 
PREEMPT.f__.0.01 


1 


1024 


4  16  64  256 

Message  Population  (messages  perLP) 

Fisure  IS.  Oveitiead  in  hypercube  simulator  using  deadlock  avoidance. 
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Figure  16.  Speedup  of  hypercube  simulator  using  deadlock  recovery  . 
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Figure  17.  Speedup  of  hypercube  simulator  using  deadlock  avoidance 


8.  A  Perspective  on  Lookahead:  Non-Events 

The  influence  of  lookahead  on  performance  can  be  viewed  from  another 
perspective:  processes  with  very  good  lookahead  ability  are  able  to  act  in  a 
largely  autonomous  fashion;  their  behavior  is  not  heavily  influenced  by  the 
activities  of  other  processes,  so  they  can  perform  simulation  work  at  '‘full 
speed,"  limited  only  by  the  rate  at  which  they  can  be  fed  work,  and  the 
number  of  CPU  cycles  (or  other  resources)  that  they  can  obt^.  The 
optimized  queueing  network  server  process  is  a  good  ezample  of  such 
autonomous  behavior. 

On  the  other  hand,  processes  with  poor  lookahead  ability  must 
frequently  obtain  additional  information  from  other  processes  before  they 
can  safely  proceed.  This  is  unfortunate  because  not  only  must  such 
processes  wait  for  teal  events  to  be  generated  by  other  processes 
(corresponding  to  data  dependencies  that  cannot  be  circumvented),  but 
often  they  must  also  wait  to  be  sure  other  events  will  not  occur.  The  fact 
that  an  airplane  will  not  crash  and  close  the  airpon  in  the  next  moment  of 
simulated  time  must  be  discovered  before  the  airport  process  can  go  about 
its  business  of  deciding  what  wilt  happen  next  We  call  these  "phantom" 
events  that  never  materialize  non-evenu.  Chandy  and  Misra  recently 
captured  these  notions  in  an  elegant  formalism  called  conditional  and 
unconditional  knowledge  [Chan87a]. 


In  the  deadlock  avoidance  simulator,  knowledge  of  non-events  is  passed 
explicitly  through  the  use  of  null  messages.  In  the  deadlock  detection  and 
recovery  simulauir,  this  information  is  obtained  by  system  deadlock  — 
processes  with  messages  waiting  to  be  processed  must  wait  until  they  can 
be  certain  that  specific  events  will  not  occur.  Certainty  as  to  the 
eventuality  of  non-events  comes  about  when  the  deadlock  is  broken,  and 
the  deadlixk  resolution  protocol  is  invtflced.  Sequential,  event  list 
simulators  incur  little  or  no  overhead  for  non-events. 

If  non-events  are  possible,  but  occur  infrequently,  the  simulator  is  often 
forced  to  wait  ne^essly,  leading  to  very  poor  performance.  The 
hypercube  simulamr  containing  preemption  and  few  high  priority  messages 
is  one  example  of  such  behavior.  Optimistic  simulation  methods  such  as 
Time  Warp  appear  to  offer  the  greatest  potential  for  addressing  this 
problem,  if  the  associated  state  saving  and  rollback  overheads  can  be 
overcome. 

9,  Conclusions 

Extensive  empirical  performance  evaluations  of  distributed  simulation 
programs  were  performed  using  the  deadlock  avoidance  and  deadlock 
detection  and  recovery  algorithms  developed  by  Chandy  and  Misra.  The 
principal  results  of  these  studies  are: 


•  The  lookahead  ability  of  logical  processes  plays  a  critical  role  in 
determining  the  efficieiKy  of  the  dendlork  avoidance  and  deadlock 
detection  and  recovery  afgariihins.  This  is  attribuied  to  the  bet  that 
processes  must  spend  an  excessive  amount  of  time  waiting  to  be  sure  that 
oenain  events  will  not  occur  if  their  lookahead  ability  is  poor. 

•  Message  avalanche  was  observed  in  the  deadlock  detection  and  recovery 
simuiaiar  for  modenie  to  high  message  populations,  and  was  necessary 
10  achieve  efficient  execution.  The  poorer  the  lookahead  ability  of  a 
process,  the  larger  the  message  population  necessary  u>  achieve 
avalanche.  If  lodcdtead  is  sufficiently  poor,  avalanche  may  never  be 
observed  for  workloads  of  practical  interest. 

•  Deadlock  detection  and  recovery  simulators  containing  difTerent  types  of 
logical  processes  can  be  adversely  affected  by  a  small  number  of 
processes  that  exhibit  poor  lookahead  ability.  The  existence  of  a  few 
such  processes  can  greatly  increase  the  message  populadon  necessary  to 
achieve  avalanche,  even  if  many  other  processes  contain  very  gtxtd 
lookahead  properties.  The  deadlock  avoidance  simulator  is  not  as 
severely  affected  by  this  behavior  if  the  bulk  of  the  simulation  activity 
avoids  processes  with  poor  lookahead, 

•  Queueing  networks  that  contain  cycles,  previously  thought  to  be  ill- 
suited  for  conservative  distributed  simul^n  algorithms,  can  achieve 
good  performance  if  servers  are  reprogrammed  to  take  advantage  of  ail 
available  lookahead. 

•  Simulation  applications  such  as  those  containing  infrequent  preemptive 
events  inherently  have  poor  lookahead  properties,  and  appear  ill-suited 
for  these  algorithms.  Applications  containing  state  dependent  behavior 
(e.g.,  load  balancing  mechanisms)  similarly  contain  moderate  to  poor 
lookahead  properties. 

•  Simulations  of  several  hypercube-based  communication  networks  with 
varying  degrees  of  lookahead  provide  empirical  data  to  support  the  above 
conclusions. 

These  studies  demonstrate  that  parallel  simulation  algorithms  can 
achieve  significant  speedups  over  sequential  event  list  implementations  if  a 
moderate  to  high  degree  of  parallelism  is  present,  even  if  there  are  many 
feedback  loops  in  the  logical  process  topology.  However,  good  lookahead 
properties  are  essential  to  obtaining  good  performance  in  simulations  using 
deadlock  avoidance  or  deadlock  detection  techniques.  The  fact  that  a  few 
processes  with  poor  lookahead  properties  can  significantly  degrade 
performance  also  limits  the  usefulness  of  these  approaches. 

Because  conservative  simulation  algorithms  must  continually  predict 
what  will  not  happen  in  order  to  be  able  to  safely  proceed,  these  studies 
raise  considerable  doubt  as  to  whether  any  conservative  parallel  simulation 
algorithm  can  obtain  significant  speedup  in  applications  containing  poor 
lookahead  properties.  In  these  situations,  optimistic  simulation  algorithms 
such  as  Time  Warp  appear  to  offer  much  greater  potential  for  achieving 
significant  speedups. 
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